Main Questions
Our main questions:
- How have youth disconnection rates in American youth changed since 2008?
- In particular, how has this changed for different gender and ethnic groups? Are any groups particularly disconnected?
MICHAEL ONTIVEROS
There are some existing typos in the tables in addition to those produced during the use of Magick. I made some assumptions along the way, documenting them. These need to be checked for accuracy.
Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
To cite this case study please use:
Wright, Carrie, and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-youth-disconnection-case-study. Disparities in Youth Disconnection (Version v1.0.0).
According to this report youth disconnection although generally showing decreasing trends for the past 7 years, shows racial and ethnic disparities, where some groups are showing increased rates of disconnection.
So what does the term “youth disconnection” mean?
According to Measure of America (a nonpartisan project of the nonprofit Social Science Research Council that is focused on opportunity in the United States) disconnected youth are:
“young people between the ages of 16 and 24 who are neither working nor in school”
They state that such disconnection hinders these individuals to aquire skills and create relationships necessary to have a sucessful adulthood.
They state that:
“people who experience a period of disconnection as young adults go on to earn less and are less likely to be employed, own a home, or report good health by the time they reach their thirties”
Disconnected youth are also referred to as opportunity youth, which has the added positive connotation that promoting such individuals can be beneficial not only for these individuals but also for their communties and for society.
We will expand beyond the Measure of America annual report to take a deeper look at differences of specific groups of youths. Identifying youths particularly at risk or disconnected, can help inform the design of targeted prevention and rengagement strategies.
This case study is motivated by this article:
Mendelson, T., Mmari, K., Blum, R. W., Catalano, R. F. & Brindis, C. D. Opportunity Youth: Insights and Opportunities for a Public Health Approach to Reengage Disconnected Teenagers and Young Adults. Public Health Rep 133, 54S-64S (2018).
This article describes strategies for prevention of disconnection and reengagement of discconnected youth and how such interventions could greatly positively impact opportunity youth for the entire trajectory of their lives and for future generations. It also points out that indeed their are disparities among different racial/ethnic groups.
Our main questions:
In this case study, we will demonstrate how to import and wrangle data available in the Portable Document Format (PDF). We will especially focus on using packages and functions from the Tidyverse, such as dplyr, ggplot2. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
Data science skills:
magick packagedplyr for data wranglingtidyr)dplyrggplot2 that are in a similar style to an existing imageStatistical concepts and methods:
We will begin by loading the packages that we will need:
library(here)
library(tidyverse)
library(pdftools)
library(magick)
library(cowplot)
library(Kendall)| Package | Use |
|---|---|
| here | to easily load and save data |
| tidyverse | for data science operations |
| pdftools | to manage PDF documents |
| magick | for image processing |
The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
So how does youth disconnection happen and what impact does it have?
There are many known risk factors, which have been identified in a variety of contexts (from family, friends, school, community, society) including:
These risk factors make it more likely for young people to miss out on education, training, and networking that can act as a foundation for a sucessful career.
There are also many known negative consequences associated with youth disconnection including but not limited to:
Photo by Jon Tyson on Unsplash
Furthermore, in 2012 it was estimated that each disconnected youth costs taxpayers $250000 during a life time due to lost tax revenue and costs for social sercices, heath care and criminal justice.
Youth disconnection can be described as a continuum, as some youths will be disconnected for a brief time, while others are chronically disconnected. Additionally, while an individual who is out of school and work and also has poor support from the realtionships of others may be further disconnected than an individual who has social support.
Here is an illustration of risk factors, protective factors and the continuum of disconnection:
##### [source]
Many programs have identified useful strategies in rengaging disconnected youth or preventing discconection of youth.
generally speaking, most programs focus on reengagement strategies, however, prevention strategies are likely to be just as important.
Reserach suggests that active involvement with at risk youth from infancy and across multiple developmental stages through young adulthood whould be the most beneficial.
In fact, the quality of parental caregiving of infants age 6-24 months has actually been shown to be a predictor of high school dropout rates! Thus early interventions may be very important and consistent continual engagement may prevent further disconnection of youths.
Prevention strategies include:
See here and here for listings of programs dedicated to rengaging disconnected youth or preventing disconnection.
See here and here for particular examples.
The statistics used in this section came from this article.
There are some important considerations regarding this data analysis to keep in mind:
This data used in the Measure of America project reports from the is derived from American Community Survey(ASC) which excludes or underrepresents certain opportunity youth groups, such as youths in the juvenile justice system, youths in the foster care system, and homeless youths as the survey is conducted on households. Furthermore, youths who may be more disconnected for other reasons besides not being in work or school, such as dealing with the added challenge of being a teenage mother, or being abused is not available in this dataset. Thus, this data likely underestimates youth disconnection rates.
Data about certain group intersections (meaning for example individuals of a particular gender and ethnicity) or particular groups in general such as specific ethnicities or gender or sexual identity groups such as LGBT (lesbian/gay/bisexual/transgender/queer and questioning) or nonbinary gender populations is unfortunately not available in the data used in this analysis and in most research about this topic. Luckily however, recent years of the ACS survey has more detailed infromation about a greater number of racial and ethnic groups and racial/ethnic intersections.
The statistical procedures we are using may be overly simplistic. In all data analysis, we need to be wary about deriving meaning from the statistical procedures we use.
Using image processing tools can be very helpful. The manner in which data is obtained with image processing tools is what we would describe as a black box process, a process with known inputs and outputs but unknown mechanics. Because we are unaware of how our outputs are generated from our inputs, we need to be wary of the output. With the small output we are creating in this case study, a visual inspection should suffice.
In this case study we will be using data related to youth disconnection from the two following reports from the Measure of America project:
Measure of America is a nonpartisan project of the nonprofit Social Science Research Council founded in 2007 to create easy-to-use yet methodologically sound tools for understanding well-being and opportunity in America. Through reports, interactive apps, and custom-built dashboards, Measure of America works with partners to breathe life into numbers, using data to identify areas of highest need, pinpoint levers for change, and track progress over time.
The data used in these reports comes from the American Community Survey(ASC), which is the largest survey conducted by the United States Census Bureau. The survey started in 2005 and collects data for 3.5 million households annually. Data is collected about ancestry, citizehsip, income, employment, disability among many other aspects. See here for more detailed information about the survey.
According to Wikipedia (https://en.wikipedia.org/wiki/American_Community_Survey){target="_blank"}:
Data is collected by internet, mail, telephone interviews and in-person interviews…About 95 percent of households across all response modes ultimately respond… ACS responses are confidential… and “immune from legal process”
It is a mandatory survey, it is governed by federal laws that could impose a fine of as much as $5,000 for refusing to participate.
One way to import data from a pdf is to use the pdftools packages.
From the output, it’s clear that a large amount of manipulation will be required to wrangle this data with the pdftools package.
pdf_data_example <- pdftools::pdf_data(here("docs", "Making_the_Connection.pdf"))
utils::head(pdf_data_example, 1)[[1]]
# A tibble: 51 x 6
width height x y space text
<int> <int> <int> <int> <lgl> <chr>
1 74 23 36 53 TRUE KRISTEN
2 54 23 115 53 FALSE LEWIS
3 31 11 36 79 TRUE Rupsha
4 20 11 69 79 TRUE Basu
5 2 11 94 79 TRUE |
6 34 11 101 79 TRUE REPORT
7 43 11 138 79 FALSE DESIGNER
8 34 11 36 92 TRUE Rebecca
9 31 11 73 92 TRUE Gluskin
10 2 11 109 92 TRUE |
# … with 41 more rows
pdf_tools_example <- pdftools::pdf_text(here("docs","Making_the_Connection.pdf"))
head(pdf_tools_example, 1)[1] "KRISTEN LEWIS\nRupsha Basu | REPORT DESIGNER\nRebecca Gluskin | CHIEF STATISTICIAN\nLaura Laderman | DATA ANALYST\nVikki Lassiter | STAKEHOLDER ENGAGEMENT\nBecky Ofrane | RESEARCHER\nBeth Post | REPORT DESIGNER\nMarina Recio | RESEARCHER\nMAKING THE CONNECTION\nTRANSPORTATION AND YOUTH DISCONNECTION\n MEA SUREOF AMERICA\n of the Social Science Research Council\n"
While not impossible, it’s hard not to argue that using the pdftools package in this scenario will require a lot of advanced data wrangling. While our output may be reproducible, the amount of time to achieve this reproducibility may not be any more efficient than typing the table by hand into a text or spreadsheet program.
Fortunately, there is another way we can proceed to wrangle the data.
We will demonstrate how to produce reproducible tables with image processing software in R.
For demonstrative purposes, we will import two sets of data. The first set of data will be used to highlight common errors that the image processing software may produce. The second set of data will be used to demonstrate how to circumvent these errors and produce reproducible datasets efficiently.
We will now import the data using the magick and here packages.
#bold is causing issues
image_example <- magick::image_read(here::here("img", "gender_race_ethnicity.png"))
magick::image_info(image_example)# A tibble: 1 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 997 711 sRGB TRUE 164964 72x72
We import the same data—this time without regions of the table without special formatting—using the magick and here packages.
#bold is causing issues
image1 <- image_read(here("img", "gender_race_ethnicity2.png"))
magick::image_info(image1)# A tibble: 1 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 1259 731 sRGB TRUE 188256 72x72
The first image we imported looks like this.
Parts of the table depicted in the image above contain newline characters. A newline character denotes the end of a line of text and the start of a new line of text.
This makes it more difficult to wrangle this table. Wrangling the data via regular expressions may be very tedious. It’s unlikely that image processing software can handle the newline characters—or any other special characters—correctly.
Using image_ocr() by magick on the image above will render some errors in the first few lines.
df1 <- magick::image_ocr(image_example)
df1 %>%
base::strsplit("\n") %>%
base::unlist() %>%
tibble::as_tibble()# A tibble: 21 x 1
value
<chr>
1 Youth Disconnection by Gender and by Race and Ethnicity
2 byte) ee
3 MAJOR RACIAL AND Cr erent) ed Nas
4 United States 126 147 14.1 13.21.7115 4,501,800 22.1
5 Male 12.3 15.2 14.5 13.3 12.1 11.8 2,382,500 -22.5
6 Female 12.9 14.1 13.7 13.0 11.2 11.1 2,119,400 -21.7
7 ASIAN 71 85 78 19 66 66 145,600 -21.7
8 Asian Male 6.3 8.3 74 7.2 6.7 65 73,000 -21.4
9 Asian Female 7.9 8.6 8.1 8.6 6.6 6.7 72,600 -22.0
10 WHITE 97 210897 9.4 1,961,700 -20.1
# … with 11 more rows
If we remove these lines, some new errors emerge.
# A tibble: 18 x 1
value
<chr>
1 United States 126 147 14.1 13.21.7115 4,501,800 22.1
2 Male 12.3 15.2 14.5 13.3 12.1 11.8 2,382,500 -22.5
3 Female 12.9 14.1 13.7 13.0 11.2 11.1 2,119,400 -21.7
4 ASIAN 71 85 78 19 66 66 145,600 -21.7
5 Asian Male 6.3 8.3 74 7.2 6.7 65 73,000 -21.4
6 Asian Female 7.9 8.6 8.1 8.6 6.6 6.7 72,600 -22.0
7 WHITE 97 210897 9.4 1,961,700 -20.1
8 White Male 95 12.3 11.5 10.8 10.0 9.6 1,031,200 -22.4
9 White Female 10.0 11.1 10.8 10.7 9.4 91 930,600 -17.4
10 LATINO 167-185 17.3 15.2 13.7 =~ 138.2_—_—‘1,157,300 -28.7
11 Latino Male 13.6 16.8 16.0 14.0 12.6 12.4 562,600 -26.0
12 Latina Female 20.2 20.3 18.8 16.5 14.8 13.9 594,700 -31.5
13 BLACK 20.4 225 224 206 17.2 17.9 999,700 -20.6
14 Black Male 23.7 26.0 25.6 23.5 20.1 20.8 591,600 -19.8
15 Black Female 17.0 19.0 19.3 17.6 14.2 14.8 408,000 -22.1
16 NATIVE AMERICAN 264 288 27.0 23 28 23.9 67,700 -17.1
17 Native American Male 25.0 30.9 28.0 26.9 28.1 23.3 33,200 -24.5
18 Native American Female 23.9 26.7 25.9 25.6 23.4 24.5 34,500 -B.4
Data wrangling is not an exact science. The approaches we can take are extremely dependent on the data. We can exploit patterns in the data to render the output we desire.
We will now use a cropped version of the image above without the special formatting.
# A tibble: 1 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 1259 731 sRGB TRUE 188256 72x72
# A tibble: 1 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 1259 731 sRGB TRUE 188256 72x72
We have several strings. Each cell of data is on a string separated by space.
We separate each string by space.
We split the dataframe in two: a labels section and a “data” section containing the information we are interested in.
In the first half, we remove all digits and punctuation to ensure that we are left with character labels.
In the second-half, remove commas and periods, converting the resulting string character class to numeric and selectively multiplying columns to reintroduce the decimal point correctly
df1a <- df1 %>%
purrr::map(~base::paste(.,collapse = "")) %>%
purrr::map(~base::gsub("[[:digit:]]+|[[:punct:]]+", "",.)) %>%
base::do.call(base::rbind,.) %>%
base::data.frame() %>%
dplyr::tibble()
df1b <- map(df1, tail, 8) %>%
map(~gsub("[,]+|[.]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
dplyr::mutate_if(base::is.character, base::as.numeric) %>%
dplyr::mutate_at(vars(-X7), ~ . * 0.1) %>%
tibble()
base::rm(df1)We combine the two sections of data to create our dataframe, removing and then adding column names.
df1 <- dplyr::bind_cols(df1a,
df1b)
base::names(df1) <- c()
column_names <- c("Group",
"Perc_2008",
"Perc_2010",
"Perc_2012",
"Perc_2014",
"Perc_2016",
"Perc_2017",
"N_2017",
"Delta_perc")
base::colnames(df1) <- column_namesWe remove columns with information we don’t need and use the commmon pattern in the column names to convert the data into long/narrow format.
df1 <- df1 %>%
dplyr::select(-N_2017,
-Delta_perc)
df1 <- df1 %>%
tidyr::pivot_longer(cols=contains("Perc_"),
names_to = "Year",
values_to = "Rate",
names_prefix = "Perc_") %>%
dplyr::mutate(Year = as.numeric(Year))We use the dplyr::case_when() and stringr::str_detect() function to detect patterns and create an separate column with gender and race information.
The two columns created contain TRUE/FALSE statements. These are then used to create a third column that will allow us to separate the data by its summary level.
df1 <- df1 %>%
mutate(Gender = dplyr::case_when(stringr::str_detect(Group, "Female") ~ TRUE,
stringr::str_detect(Group, "Male") ~ FALSE,
TRUE ~ NA),
Race = stringr::str_remove_all(Group,
pattern = paste(c("Female","Male","UnitedStates"),
collapse = "|"))) %>%
mutate(Race = dplyr::na_if(Race, ""))
df1 <- df1 %>%
mutate(Type = case_when(base::is.na(Gender) & base::is.na(Race) ~ "US Total",
base::is.na(Gender) & !base::is.na(Race) ~ "Race Total",
!base::is.na(Gender) & base::is.na(Race) ~ "Gender Total",
TRUE ~ "Subgroup Total"))
df1 <- df1 %>%
mutate(Gender = case_when(str_detect(Group, "Female") ~ "Female",
str_detect(Group, "Male") ~ "Male"))The columns of character class currently contain upper and lower case characters. We want to ensure that we use a cases consistently to ensure that there are no errors driven by case-sensitive function downstream.
To do this, we use the base::to_upper() function. This function makes all characters uppercase.
Finally, we homogenize the labels assigned for certain groups, fill in missing (NA) values with a string, and remove columns we no longer need from the dataframe.
df1 <- df1 %>%
mutate(Race = case_when(Race == "LATINA" ~ "LATINO/A",
Race == "LATINO" ~ "LATINO/A",
Race == "NATIVEAMERICAN" ~ "NATIVE AMERICAN",
TRUE ~ Race)) %>%
mutate(Gender = tidyr::replace_na(Gender, "ALL"),
Race = tidyr::replace_na(Race, "ALL")) %>%
dplyr::select(-Group)We can repeat this process for the other two tables listed on page 39.
Let’s look at the table without the special formatting.
As you can see, there are empty spaces. According to the PDF, these spaces are empty to denote that the estimates are unreliable.
This may cause problems. Whitespace must be handled differently. We may not want to process the entire image for this reason.
Instead, we can use separate images to ensure a simpler process like that above.
We read the three images.
image2a <- image_read(here("img", "asian_subgroupsA.png"))
image2b <- image_read(here("img", "asian_subgroupsB.png"))
image2c <- image_read(here("img", "asian_subgroupsC.png"))
magick::image_info(image2a)# A tibble: 1 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 757 687 sRGB TRUE 95660 72x72
# A tibble: 1 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 751 177 sRGB TRUE 28913 72x72
# A tibble: 1 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 760 216 sRGB TRUE 33402 72x72
The images look like this.
We save the text from the images into objects.
df2a <- magick::image_ocr(image2a)
df2b <- magick::image_ocr(image2b)
df2c <- magick::image_ocr(image2c)We process these objects separately. Note that we use a very similar process to that employed in the wrangling of the previous table.
df2a <- df2a %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()
df2b <- df2b %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()
df2c <- df2c %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()We then combine the objects with the dplyr::bind_rows() function.
The process is now very similar to the previous table.
MICHAEL ONTIVEROS
I used base R to remove the first three rows of a dataframe in the following chunk. I am not aware of a tidyverse solution for this; I am sure one exists.
df2 <- bind_rows(df2a,
df2b,
df2c)
df2 <- df2 %>%
dplyr::select(value) %>%
pull(value) %>%
str_split(" ")
df2b <- map(df2, tail, 2) %>%
map(~gsub("[,]+|[.]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
mutate_if(is.character, as.numeric) %>%
mutate_at(vars(-X2), ~ . * 0.1) %>%
tibble()
df2a <- df2 %>%
map(~paste(.,collapse = "")) %>%
map(~gsub("[[:digit:]]+|[[:punct:]]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
tibble()
df2 <- bind_cols(df2a, df2b)
names(df2) <- c()
column_names <- c("Group",
"Rate",
"N_2017")
colnames(df2) <- column_names
df2 <- df2 %>%
dplyr::select(-N_2017)
df2 <- df2 %>%
dplyr::mutate(Year = 2017)
df2 <- df2 %>%
mutate(Gender = case_when(str_detect(Group, "Female") ~ TRUE,
str_detect(Group, "Male") ~ FALSE,
TRUE ~ NA),
Subgroup = str_remove_all(Group,
pattern = paste(c("Female",
"Male",
"ASIAN",
"Asian"),
collapse = "|"))) %>%
mutate(Subgroup = na_if(Subgroup, ""))
glimpse(df2)Rows: 25
Columns: 5
$ Group <chr> "UnitedStates", "Male", "Female", "ASIAN", "AsianMale", "Asi…
$ Rate <dbl> 11.5, 11.8, 11.1, 6.6, 6.5, 6.7, 4.3, 4.7, 3.9, 5.5, 7.5, 3.…
$ Year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
$ Gender <lgl> NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE…
$ Subgroup <chr> "UnitedStates", NA, NA, NA, NA, NA, "CHINESE", "Chinese", "C…
df2 <- df2[-(1:3),]
df2 <- df2 %>%
mutate(Type = case_when(is.na(Gender) & is.na(Subgroup) ~ "Asian Total",
is.na(Gender) & !is.na(Subgroup) ~ "Subgroup Total",
!is.na(Gender) & is.na(Subgroup) ~ "Gender Total",
TRUE ~ "Subgroup Total"))
glimpse(df2)Rows: 22
Columns: 6
$ Group <chr> "ASIAN", "AsianMale", "AsianFemale", "CHINESE", "ChineseMale…
$ Rate <dbl> 6.6, 6.5, 6.7, 4.3, 4.7, 3.9, 5.5, 7.5, 3.4, 5.9, 4.1, 7.8, …
$ Year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
$ Gender <lgl> NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE…
$ Subgroup <chr> NA, NA, NA, "CHINESE", "Chinese", "Chinese", "VIETNAMESE", "…
$ Type <chr> "Asian Total", "Gender Total", "Gender Total", "Subgroup Tot…
df2 <- df2 %>%
mutate(Gender = case_when(str_detect(Group, "Female") ~ "Female",
str_detect(Group, "Male") ~ "Male"))
df2 <- df2 %>%
mutate_if(is.character, tolower) %>%
mutate_if(is.character, toupper)
df2 <- df2 %>%
mutate(Gender = replace_na(Gender, "ALL"),
Subgroup = replace_na(Subgroup, "ALL")) %>%
dplyr::select(-Group) %>%
mutate(Subgroup)
df2 <- df2 %>%
mutate(Subgroup = case_when(Subgroup == "TWOORMORE" ~ "TWO OR MORE",
TRUE ~ Subgroup))Note that we took a process that had successfully worked for us and modified it slightly for separate, similarly-sourced data.
This is a common approach in data science. Often, the duration of the wrangling process can limit the depth of an analysis for practical reasons. Using tried methods can help reduce the time needed to wrangle data and allow time for other parts of an analysis.
Let’s add the 2018 data for this group.
We import the image.
As you can see, we have repeated newlines (\n). We can remove these with some simplex regex.
df2_2018 <- gsub('([\n])\\1+',
'\\1',
df2_2018)
df2_2018 <- gsub("[[:punct:]]+",
"",
df2_2018)
df2_2018 <- gsub(" i ",
"",
df2_2018)We proceed, making slight modifications to the process as needed.
The bold font appears to have caused a typos.
# A tibble: 17 x 1
value
<chr>
1 CHINESE 41 23300
2 Men 45 12500
3 Women 37 10800
4 INDIAN B44 21800
5 Men 47 10400
6 Women 61 11300
7 KOREAN 55 9000
8 Men 56 4700
9 Women 54 4300
10 VIETNAMESE 63 15300
11 Men 76 9000
12 Women 50 6400
13 FILIPINO 68 20800
14 Men 63 10000
15 Women74 10800
16 HMONG102 5300
17 CAMBODIAN 138 4200
We fix the typos.
df2_2018 <- df2_2018 %>%
strsplit("\n") %>%
unlist() %>%
gsub(" B44 ","54",.) %>%
gsub("Women74","Women 74",.) %>%
gsub("HMONG102","HMONG 102",.) %>%
as_tibble()
df2_2018 %>%
print(.,n = dim(.)[1])# A tibble: 17 x 1
value
<chr>
1 CHINESE 41 23300
2 Men 45 12500
3 Women 37 10800
4 INDIAN 54 21800
5 Men 47 10400
6 Women 61 11300
7 KOREAN 55 9000
8 Men 56 4700
9 Women 54 4300
10 VIETNAMESE 63 15300
11 Men 76 9000
12 Women 50 6400
13 FILIPINO 68 20800
14 Men 63 10000
15 Women 74 10800
16 HMONG 102 5300
17 CAMBODIAN 138 4200
We then continue as we would normally.
df2_2018 <- df2_2018 %>%
dplyr::select(value) %>%
pull(value) %>%
str_split(" ")
df2_2018 <- lapply(df2_2018,
function(x) x[nchar(x) >= 1])
df2a_2018 <- df2_2018 %>%
map(~paste(.,collapse = "")) %>%
map(~gsub("[[:digit:]]+|[[:punct:]]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
tibble()
df2b_2018 <- map(df2_2018, tail, 2) %>%
do.call(rbind,.) %>%
data.frame() %>%
mutate_if(is.character, as.numeric) %>%
mutate_at(vars(-X2), ~ . * 0.1) %>%
tibble()
rm(df2_2018)
df2_2018 <- bind_cols(df2a_2018,
df2b_2018)
names(df2_2018) <- c()
column_names <- c("Group",
"Rate",
"N_2017")
colnames(df2_2018) <- column_names
df2_2018 <- df2_2018 %>%
dplyr::select(-N_2017)
df2_2018 <- df2_2018 %>%
dplyr::mutate(Year = 2018)
df2_2018 <- df2_2018 %>%
mutate(Gender = case_when(str_detect(Group, "Women") ~ TRUE,
str_detect(Group, "Men") ~ FALSE,
TRUE ~ NA))
labels <- unlist(df2_2018[c(seq(1,15, by=3),16,17),1], use.names = FALSE)
dim(df2_2018)[1][1] 17
labels_3 <- c(rep(labels[1:5], each = dim(df2_2018)[1]/(length(labels)-2)),
"HMONG",
"CAMBODIAN")
df2_2018$Subgroup <- labels_3
df2_2018 <- df2_2018 %>%
mutate(Type = "Subgroup Total")
glimpse(df2_2018)Rows: 17
Columns: 6
$ Group <chr> "CHINESE", "Men", "Women", "INDIAN", "Men", "Women", "KOREAN…
$ Rate <dbl> 4.1, 4.5, 3.7, 5.4, 4.7, 6.1, 5.5, 5.6, 5.4, 6.3, 7.6, 5.0, …
$ Year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
$ Gender <lgl> NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE…
$ Subgroup <chr> "CHINESE", "CHINESE", "CHINESE", "INDIAN", "INDIAN", "INDIAN…
$ Type <chr> "Subgroup Total", "Subgroup Total", "Subgroup Total", "Subgr…
df2_2018 <- df2_2018 %>%
mutate(Gender = case_when(Gender == TRUE ~ "Female",
Gender == FALSE ~ "Male",
TRUE ~ "All"))
df2_2018 <- df2_2018 %>%
mutate_if(is.character, tolower) %>%
mutate_if(is.character, toupper)
df2_2018 <- df2_2018 %>%
dplyr::select(-Group)The dataframe we produced does not contain totals.
# A tibble: 17 x 5
Rate Year Gender Subgroup Type
<dbl> <dbl> <chr> <chr> <chr>
1 4.1 2018 ALL CHINESE SUBGROUP TOTAL
2 4.5 2018 MALE CHINESE SUBGROUP TOTAL
3 3.7 2018 FEMALE CHINESE SUBGROUP TOTAL
4 5.4 2018 ALL INDIAN SUBGROUP TOTAL
5 4.7 2018 MALE INDIAN SUBGROUP TOTAL
6 6.1 2018 FEMALE INDIAN SUBGROUP TOTAL
7 5.5 2018 ALL KOREAN SUBGROUP TOTAL
8 5.6 2018 MALE KOREAN SUBGROUP TOTAL
9 5.4 2018 FEMALE KOREAN SUBGROUP TOTAL
10 6.3 2018 ALL VIETNAMESE SUBGROUP TOTAL
11 7.6 2018 MALE VIETNAMESE SUBGROUP TOTAL
12 5 2018 FEMALE VIETNAMESE SUBGROUP TOTAL
13 6.8 2018 ALL FILIPINO SUBGROUP TOTAL
14 6.3 2018 MALE FILIPINO SUBGROUP TOTAL
15 7.4 2018 FEMALE FILIPINO SUBGROUP TOTAL
16 10.2 2018 ALL HMONG SUBGROUP TOTAL
17 13.8 2018 ALL CAMBODIAN SUBGROUP TOTAL
We can find these totals in the PDF directly and create rows as needed
We load the PDF.
We add the rows.
df2_2018 <- df2_2018 %>%
add_row(Rate = 6.2,
Year = 2018,
Gender = "ALL",
Subgroup = "ALL",
Type = "ASIAN TOTAL") %>%
add_row(Rate = 6.1,
Year = 2018,
Gender = "FEMALE",
Subgroup = "ALL",
Type = "GENDER TOTAL") %>%
add_row(Rate = 6.4,
Year = 2018,
Gender = "MALE",
Subgroup = "ALL",
Type = "GENDER TOTAL")We repeat this process again for Latino/a subgroups.
The table, without the special formatting, looks like this.
There are no whitespaces in this table.
We proceed using what we’ve learned while wrangling the first two tables.
df3 <- magick::image_ocr(image3)
df3 %>%
base::strsplit("\n") %>%
base::unlist() %>%
tibble::as_tibble()# A tibble: 17 x 1
value
<chr>
1 LATINO 13.2 1,157,300
2 Latino Male 12.4 562,600
3 Latina Female 13.9 594,700
4 SOUTH AMERICAN 8.4 37,600
5 South American Male 9.1 20,400
6 South American Female 17 17,200
7 CENTRAL AMERICAN 12.0 93,100
8 Central American Male 9.3 37,900
9 Central American Female 15.0 55,200
10 MEXICAN 13.3 762,400
11 Mexican Male 12.2 358,200
12 Mexican Female 14.4 404,200
13 OTHER LATINO 13.6 44,800
14 Other Latino Male 15.3 27,600
15 Other Latina Female 11.5 17,300
16 PR, DR, Cuban Female 15.7 114,500
17 PR, DR, Cuban Female 14.4 96,600
We are often presented with scenarios where stand-alone approaches are difficult or time-consuming.
It is always best to document the steps take to respond to these scenarios. Wrangling this third table is a prime example of this.
We are missing a row. Let’s manually add the row.
df3 <- df3 %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()
df3 <- df3 %>%
rbind("PR, DR, Cuban 15.1 211,200")We can now proceed as we did with the previous tables.
df3 <- df3 %>%
dplyr::select(value) %>%
pull(value) %>%
str_split(" ")
df3b <- map(df3, tail, 2) %>%
map(~gsub("[,]+|[.]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
mutate_if(is.character, as.numeric) %>%
mutate_at(vars(-X2), ~ . * 0.1) %>%
tibble()
df3a <- df3 %>%
map(~paste(.,collapse = "")) %>%
map(~gsub("[[:digit:]]+|[[:punct:]]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
tibble()
rm(df3)
df3 <- bind_cols(df3a, df3b)
names(df3) <- c()
column_names <- c("Group",
"Rate",
"N_2017")
colnames(df3) <- column_namesIf we look at the last few rows, we see that there is a typo. There are two female groups.
# A tibble: 6 x 3
Group Rate N_2017
<chr> <dbl> <dbl>
1 OTHERLATINO 13.6 44800
2 OtherLatinoMale 15.3 27600
3 OtherLatinaFemale 11.5 17300
4 PRDRCubanFemale 15.7 114500
5 PRDRCubanFemale 14.4 96600
6 PRDRCuban 15.1 211200
Sometimes when wrangling text data, we will come across a typo. We need to determine how to respond to the typo and make note of it. It’s often best to consult a secondary source to confirm that changes made are accurate.
For the purposes of this case study, we will assume that the first of the two rows represents male disconnection rates in the Latino/a subgroup; this would be consistent with the ordering of genders in the previous subgroups.
Let’s make the correction to the typo.
df3 <- df3 %>%
mutate(Group = case_when(Group == "PRDRCubanFemale" & N_2017 == 114500 ~ "PRDRCubanMale",
TRUE ~ Group))It looks like we’ve succesfully corrected the typo.
# A tibble: 6 x 3
Group Rate N_2017
<chr> <dbl> <dbl>
1 OTHERLATINO 13.6 44800
2 OtherLatinoMale 15.3 27600
3 OtherLatinaFemale 11.5 17300
4 PRDRCubanMale 15.7 114500
5 PRDRCubanFemale 14.4 96600
6 PRDRCuban 15.1 211200
We can continue with the process we’ve developed now that we have made the correction.
df3 <- df3 %>%
dplyr::select(-N_2017)
df3 <- df3 %>%
dplyr::mutate(Year = 2017)
df3 <- df3 %>%
mutate(Gender = case_when(str_detect(Group, "Female") ~ TRUE,
str_detect(Group, "Male") ~ FALSE,
TRUE ~ NA),
Subgroup = str_remove_all(Group,
pattern = paste(c("Female",
"Male",
"LATINO",
"Latino",
"Latina"),
collapse = "|"))) %>%
mutate(Subgroup = na_if(Subgroup, ""))
glimpse(df3)Rows: 18
Columns: 5
$ Group <chr> "LATINO", "LatinoMale", "LatinaFemale", "SOUTHAMERICAN", "So…
$ Rate <dbl> 13.2, 12.4, 13.9, 8.4, 9.1, 1.7, 12.0, 9.3, 15.0, 13.3, 12.2…
$ Year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
$ Gender <lgl> NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE…
$ Subgroup <chr> NA, NA, NA, "SOUTHAMERICAN", "SouthAmerican", "SouthAmerican…
df3 <- df3 %>%
mutate(Type = case_when(is.na(Gender) & is.na(Subgroup) ~ "Latino/a Total",
is.na(Gender) & !is.na(Subgroup) ~ "Subgroup Total",
!is.na(Gender) & is.na(Subgroup) ~ "Gender Total",
TRUE ~ "Subgroup Total"))
glimpse(df3)Rows: 18
Columns: 6
$ Group <chr> "LATINO", "LatinoMale", "LatinaFemale", "SOUTHAMERICAN", "So…
$ Rate <dbl> 13.2, 12.4, 13.9, 8.4, 9.1, 1.7, 12.0, 9.3, 15.0, 13.3, 12.2…
$ Year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, …
$ Gender <lgl> NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE…
$ Subgroup <chr> NA, NA, NA, "SOUTHAMERICAN", "SouthAmerican", "SouthAmerican…
$ Type <chr> "Latino/a Total", "Gender Total", "Gender Total", "Subgroup …
df3 <- df3 %>%
mutate(Gender = case_when(str_detect(Group, "Female") ~ "Female",
str_detect(Group, "Male") ~ "Male"))
df3 <- df3 %>%
mutate_if(is.character, tolower) %>%
mutate_if(is.character, toupper)
df3 <- df3 %>%
mutate(Gender = replace_na(Gender, "ALL"),
Subgroup = replace_na(Subgroup, "ALL")) %>%
dplyr::select(-Group) %>%
mutate(Subgroup)
df3 <- df3 %>%
mutate(Subgroup = case_when(Subgroup == "SOUTHAMERICAN" ~ "SOUTH AMERICAN",
Subgroup == "CENTRALAMERICAN" ~ "CENTRAL AMERICAN",
Subgroup == "PRDRCUBAN" ~ "PR/DR/CUBAN",
TRUE ~ Subgroup))Let’s add the 2018 data to this dataframe.
We import the image.
image5 <- image_read(here("img", "latino_a_subgroups_2018.png"))
df3_2018 <- magick::image_ocr(image5)As you can see, we have repeated newlines (\n). We can remove these with some simplex regex.
We proceed, making slight modifications to the process as needed.
df3_2018 <- df3_2018 %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()
df3_2018 <- df3_2018 %>%
dplyr::select(value) %>%
pull(value) %>%
str_split(" ")
df3_2018 <- lapply(df3_2018,
function(x) x[nchar(x) >= 1])
df3a_2018 <- df3_2018 %>%
map(~paste(.,collapse = "")) %>%
map(~gsub("[[:digit:]]+|[[:punct:]]+", "",.)) %>%
do.call(rbind,.) %>%
data.frame() %>%
tibble()
df3b_2018 <- map(df3_2018, tail, 2) %>%
do.call(rbind,.) %>%
data.frame() %>%
mutate_if(is.character, as.numeric) %>%
mutate_at(vars(-X2), ~ . * 0.1) %>%
tibble()
rm(df3_2018)
df3_2018 <- bind_cols(df3a_2018,
df3b_2018)
names(df3_2018) <- c()
column_names <- c("Group",
"Rate",
"N_2017")
colnames(df3_2018) <- column_names
df3_2018 <- df3_2018 %>%
dplyr::select(-N_2017)
df3_2018 <- df3_2018 %>%
dplyr::mutate(Year = 2018)
df3_2018 <- df3_2018 %>%
mutate(Gender = case_when(str_detect(Group, "Women") ~ TRUE,
str_detect(Group, "Men") ~ FALSE,
TRUE ~ NA))
labels <- unlist(df3_2018[seq(1,12, by =3),1], use.names = FALSE)
dim(df3_2018)[1][1] 12
labels_3 <- rep(labels, each = dim(df3_2018)[1]/length(labels))
df3_2018$Subgroup <- labels_3
df3_2018 <- df3_2018 %>%
mutate(Type = "Subgroup Total")
glimpse(df3_2018)Rows: 12
Columns: 6
$ Group <chr> "SOUTHAMERICAN", "Men", "Women", "MEXICAN", "Men", "Women", …
$ Rate <dbl> 8.0, 7.5, 8.6, 12.9, 12.0, 13.8, 13.7, 14.9, 12.4, 13.7, 11.…
$ Year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
$ Gender <lgl> NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE, TRUE, NA, FALSE…
$ Subgroup <chr> "SOUTHAMERICAN", "SOUTHAMERICAN", "SOUTHAMERICAN", "MEXICAN"…
$ Type <chr> "Subgroup Total", "Subgroup Total", "Subgroup Total", "Subgr…
df3_2018 <- df3_2018 %>%
mutate(Gender = case_when(Gender == TRUE ~ "Female",
Gender == FALSE ~ "Male",
TRUE ~ "All"))
df3_2018 <- df3_2018 %>%
mutate_if(is.character, tolower) %>%
mutate_if(is.character, toupper)
df3_2018 <- df3_2018 %>%
dplyr::select(-Group)
df3_2018 <- df3_2018 %>%
mutate(Subgroup = case_when(Subgroup == "SOUTHAMERICAN" ~ "SOUTH AMERICAN",
Subgroup == "CENTRALAMERICAN" ~ "CENTRAL AMERICAN",
Subgroup == "PRDRCUBAN" ~ "PR/DR/CUBAN",
TRUE ~ Subgroup))We load the PDF.
We add the rows.
df3_2018 <- df3_2018 %>%
add_row(Rate = 12.8,
Year = 2018,
Gender = "ALL",
Subgroup = "ALL",
Type = "LATINO/A TOTAL") %>%
add_row(Rate = 13.3,
Year = 2018,
Gender = "FEMALE",
Subgroup = "ALL",
Type = "GENDER TOTAL") %>%
add_row(Rate = 12.3,
Year = 2018,
Gender = "MALE",
Subgroup = "ALL",
Type = "GENDER TOTAL")We add the 2018 data to the dataframe
We will use multiple images to import the data on page 36 to produce maps.
MICHAEL ONTIVEROS
This code is not complete. If there is time, we will return to it! Magick is having trouble with quadrant 1 and 4. I could not figure out why.
quadrant1 <- image_read(here("img", "state_quadrant1.png"))
quadrant2 <- image_read(here("img", "state_quadrant2.png"))
quadrant3 <- image_read(here("img", "state_quadrant3.png"))
quadrant4 <- image_read(here("img", "state_quadrant4.png"))
quadrant1 <- magick::image_ocr(quadrant1)
quadrant2 <- magick::image_ocr(quadrant2)
quadrant3 <- magick::image_ocr(quadrant3)
quadrant4 <- magick::image_ocr(quadrant4)
labels_quad1_3 <- paste0(quadrant1, quadrant3)
labels_quad2_4 <- paste0(quadrant2, quadrant4)
labels_quad1_3 <- labels_quad1_3 %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()
labels_quad2_4 <- labels_quad2_4 %>%
strsplit("\n") %>%
unlist() %>%
as_tibble()
df1 <- df1 %>%
dplyr::select(value) %>%
pull(value) %>%
str_split(" ")data_map <- map_data("state") %>%
filter(region=="california")
years <- c(seq(2008,2016,by=2),2017)
index_rep <- dim(data_map)[1]
data_map <- bind_rows(replicate(length(years), data_map, simplify = FALSE))
data_map$year <- rep(years, each = index_rep)
data_map <- data_map %>%
group_by(region, year) %>%
mutate(rank_ran = rank(year, ties.method = "random"))
data_map <- data_map[order(data_map$order),]
library(gganimate)
ggplot(data_map, aes(x = long, y = lat, group = group, fill=rank_ran)) +
geom_polygon() +
transition_time(time = year)Repeated Cross-sectional Data
We have pooled (repeated) cross-sectional data.
This is data produced from repeated measurement of a population over time.
It is often infeasible to collect data for an entire population at once. However, we can still obtain meaningful measures using a random sample of the population.
At specific time-points, data is collected from a sample of the population. The individuals in each sample are not necessarily the same individuals. This separates pooled cross-sectional data from panel data, which is longitudinal data from repeated measurement of the same people.
By sampling from a population at multiple time points, we can generate population level statistics. Although these statistics have some random error, they can provide insight into how the measure variable is changing in a population over time.
We can accomplish this by plotting the measured values over time. Sometimes, however, the trend isn’t exactly clear. Fortunately, there are statistical methods to resolve this issue.
The Mann-Kendall trend test—a variation of the Kendall rank correlation coefficient—tests whether there is a monotonic association, an association that does not increase or decrease but remains static across a dimension.
Recall the youth disconnection rates for Native Americans, some of the highest in the first table we examined.
Let’s conduct a Mann-Kendall test for trend.
We can accomplish this with the Kendall::MannKendall() function. The Kendall::MannKendall() accepts a vector of data for which a trend may be observed. Consulting the documentation for the Kendall::MannKendall() function available on CRAN, we can “test for a a monotonic trend in a time series”.
df1 %>%
filter(Gender == "ALL",
Race == "NATIVE AMERICAN") %>%
pull(Rate) %>%
MannKendall(.) %>%
summary()Score = -7 , Var(Score) = 28.33333
denominator = 15
tau = -0.467, 2-sided pvalue =0.25966
There does not appear to be a change in the trend. However, it’s important to note that we only have 6 observations.
We can also explore the trend using simple linear regression.
df1 %>%
filter(Gender == "ALL",
Race == "NATIVE AMERICAN") %>%
lm(Rate ~ Year, data = .) %>%
summary()
Call:
lm(formula = Rate ~ Year, data = .)
Residuals:
1 2 3 4 5 6
-2.4332 2.2978 0.8288 0.4597 0.2907 -1.4438
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 359.1159 487.3956 0.737 0.502
Year -0.1655 0.2421 -0.683 0.532
Residual standard error: 1.889 on 4 degrees of freedom
Multiple R-squared: 0.1045, Adjusted R-squared: -0.1193
F-statistic: 0.467 on 1 and 4 DF, p-value: 0.5319
For each one year change, the mean increase in disconnection rates is -0.1654795. This relationship is not statistically significant. Again, we are largely limited by the number of observations in this dataset.
We can visualize the relationship above.
df1 %>%
filter(Gender == "ALL",
Race == "NATIVE AMERICAN") %>%
ggplot(aes(x = Year, y = Rate)) +
geom_smooth(method = "lm", color = "red") +
geom_point() +
scale_x_continuous(breaks = seq(2008, 2018, by = 1),
labels = seq(2008, 2018, by = 1),
limits = c(2008, 2018)) +
theme_minimal() +
labs(title = "Youth Disconnection Rates of Native American Youth",
subtitle = "2008 - 2017",
x = "Year",
y = "Disconnection Rate")As we can see, there is a large amount of uncertainty around the fitted line.
Let’s visualize the data!
Let’s reproduce the example below.
We can create a version of the above example with ggplot from tidyverse.
There are color identifying websites only such as this.
Using one of these websites, we identify the hex triplet for the color used in the visualization included in the PDF: #008393.
fa_figurine <- image_read("https://upload.wikimedia.org/wikipedia/commons/7/7c/User_font_awesome.svg")
fa_figurine <- image_fill(fa_figurine,
color = "#008393",
point = "+800+800",
fuzz = 0)
fa_figurine <- image_fill(fa_figurine,
color = "#008393",
point = "+800+1000",
fuzz = 0)
df1 %>%
filter(Type == "RACE TOTAL") %>%
ggplot(aes(x = Year, y = Rate, group=Race)) +
geom_line(color = "#008393", size = 0.5) +
geom_point(color = "#008393", size = 3) +
scale_x_continuous(breaks = seq(2008,2018, by=1),
limits = c(2008,2018)) +
scale_y_continuous(breaks = seq(5,30, by =5),
limits = c(5,30)) +
draw_image(fa_figurine, x = 2017, y = 23.5, scale = 2) +
draw_image(fa_figurine, x = 2017, y = 17.5, scale = 2) +
draw_image(fa_figurine, x = 2017, y = 13, scale = 2) +
draw_image(fa_figurine, x = 2017, y = 9, scale = 2) +
draw_image(fa_figurine, x = 2017, y = 6.5, scale = 2) +
labs(title = "FIGURE 1 YOUTH DISCONNECTION BY RACE AND ETHNICITY, 2008 - 2017",
y = "YOUTH DISCONNECTION (%)") +
theme_classic() +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank())We can build off of this idea, using a custom color palette to create a gradient based off the color used.
custom_pal <- colorRampPalette(c("white", "#008393"))
gender_n <- 3
asian_total <- df2 %>%
filter(Year == 2017) %>%
filter(Gender == "ALL",
Subgroup == "ALL") %>%
pull(Rate)
df2 %>%
filter(Year == 2017) %>%
complete(Gender, Subgroup) %>%
group_by(Gender) %>%
mutate(sub_rank = rank(Rate, ties.method = "min")) %>%
group_by(Subgroup) %>%
mutate(rank_all = sub_rank[Gender == "ALL"]) %>%
ungroup() %>%
mutate(Subgroup = fct_reorder(Subgroup, rank_all)) %>%
ggplot(aes(x = Subgroup, y = Rate, fill = Gender)) +
geom_hline(yintercept = asian_total,
color = "black",
linetype = 2) +
geom_bar(stat = "identity",
color = "transparent",
size = 0.5,
position = "dodge",
width = 0.5) +
labs(title = "FIGURE X YOUTH DISCONNECTION BY ASIAN SUBGROUP, 2017",
subtitle = "ORDERED BY OVERALL DISCONNECTION",
y = "YOUTH DISCONNECTION (%)",
fill = "Gender") +
scale_fill_manual(values = rev(custom_pal(gender_n + 1))) +
scale_y_continuous(breaks = seq(0,20,2),
labels = seq(0,20,2),
limits = c(0,20)) +
theme_classic() +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1)) +
annotate("text", label = 'bold("ASIAN TOTAL")',
color = "#008393",
size = 3,
x = 1.2,
y = asian_total + 1,
parse = TRUE)From the above plot, it becomes apparent that the Hmong subgroup produces a small proportion of the total number of asian disconnected youth. The Asian total youth disconnection rate is more alike the youth disconnection rates for all other subgroups than the Hmong youth disconnection rate.
We can confirm this by revisiting the table.
The Hmong group represents 6% of all Asian disconnected youth.
This shows the importance of adding small details such as the composite line to plots. It helps provide a simple yet nuanced picture of what is going on.
Lastly, we can add annotations to add provide even more depth to the visualization.
latino_a_total <- df3 %>%
filter(Year == 2017) %>%
filter(Gender == "ALL",
Subgroup == "ALL") %>%
pull(Rate)
df3 %>%
filter(Year == 2017) %>%
complete(Gender, Subgroup) %>%
group_by(Gender) %>%
mutate(sub_rank = rank(Rate, ties.method = "min")) %>%
group_by(Subgroup) %>%
mutate(rank_all = sub_rank[Gender == "ALL"]) %>%
ungroup() %>%
mutate(Subgroup = fct_reorder(Subgroup, rank_all)) %>%
ggplot(aes(x = Subgroup, y = Rate, fill = Gender)) +
geom_hline(yintercept = latino_a_total,
color = "black",
linetype = 2) +
geom_bar(stat = "identity",
color = "transparent",
size = 0.5,
position = "dodge",
width = 0.5) +
labs(title = "FIGURE X YOUTH DISCONNECTION BY LATINO/A SUBGROUP, 2017",
subtitle = "ORDERED BY OVERALL DISCONNECTION",
y = "YOUTH DISCONNECTION (%)",
fill = "Gender") +
scale_fill_manual(values = rev(custom_pal(gender_n + 1))) +
scale_y_continuous(breaks = seq(0,20,2),
labels = seq(0,20,2),
limits = c(0,20)) +
theme_classic() +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1)) +
annotate("text", label = 'bold("LATINO TOTAL")',
color = "#008393",
size = 3,
x = 1.2,
y = latino_a_total + 1,
parse = TRUE)df2 %>%
complete(Gender, Subgroup, Year) %>%
group_by(Subgroup) %>%
mutate(missing = sum(is.na(Rate))) %>%
filter(missing == 0) %>%
dplyr::select(-missing) %>%
ungroup() %>%
group_by(Gender) %>%
mutate(sub_rank = rank(Rate, ties.method = "min")) %>%
group_by(Subgroup, Year) %>%
mutate(rank_all = sub_rank[Gender == "ALL"]) %>%
ungroup() %>%
group_by(Year) %>%
mutate(threshold = Rate[Gender == "ALL" & Subgroup == "ALL"]) %>%
ungroup() %>%
mutate(Subgroup = fct_reorder(Subgroup, rank_all)) %>%
ggplot(aes(x = Subgroup, y = Rate, fill = Gender)) +
facet_wrap(Year ~., ncol = 1) +
geom_hline(aes(yintercept = threshold), linetype = 2) +
geom_bar(stat = "identity",
color = "transparent",
size = 0.5,
position = "dodge",
width = 0.5) +
labs(title = "FIGURE X YOUTH DISCONNECTION BY ASIAN SUBGROUP, 2017-2018",
subtitle = "ORDERED BY OVERALL DISCONNECTION",
y = "YOUTH DISCONNECTION (%)",
fill = "Gender") +
scale_fill_manual(values = rev(custom_pal(gender_n + 1))) +
scale_y_continuous(breaks = seq(0,10,2),
labels = seq(0,10,2),
limits = c(0,10)) +
theme_classic() +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1))df3 %>%
complete(Gender, Subgroup, Year) %>%
group_by(Subgroup) %>%
mutate(missing = sum(is.na(Rate))) %>%
filter(missing == 0) %>%
dplyr::select(-missing) %>%
ungroup() %>%
group_by(Gender) %>%
mutate(sub_rank = rank(Rate, ties.method = "min")) %>%
group_by(Subgroup, Year) %>%
mutate(rank_all = sub_rank[Gender == "ALL"]) %>%
ungroup() %>%
group_by(Year) %>%
mutate(threshold = Rate[Gender == "ALL" & Subgroup == "ALL"]) %>%
ungroup() %>%
mutate(Subgroup = fct_reorder(Subgroup, rank_all)) %>%
ggplot(aes(x = Subgroup, y = Rate, fill = Gender)) +
facet_wrap(Year ~., ncol = 1) +
geom_hline(aes(yintercept = threshold), linetype = 2) +
geom_bar(stat = "identity",
color = "transparent",
size = 0.5,
position = "dodge",
width = 0.5) +
labs(title = "FIGURE X YOUTH DISCONNECTION BY LATINO/A SUBGROUP, 2017",
subtitle = "ORDERED BY OVERALL DISCONNECTION",
y = "YOUTH DISCONNECTION (%)",
fill = "Gender") +
scale_fill_manual(values = rev(custom_pal(gender_n + 1))) +
scale_y_continuous(breaks = seq(0,20,2),
labels = seq(0,20,2),
limits = c(0,20)) +
theme_classic() +
theme(title = element_text(size = 10,
color = "#008393",
face = "bold"),
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1))avocadoThis concepts listed here must be revisited.
Terms and concepts covered:
Tidyverse
RStudio cheatsheets
Inference
Regression
Different types of regression
Ordinary least squares method
Residual
Packages used in this case study:
| Package | Use |
|---|---|
| here | to easily load and save data |
| tidyverse | for data science operations |
| pdftools | to manage PDF documents |
| magick | for image processing |
We would like to acknowledge Tamar Mendelson for assisting in framing the major direction of the case study.
We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.